Automated Test: kafka-consumer-parallel-after #328

admin-coderabbit · 2026-02-04T19:45:37Z

This pull request was automatically created by @coderabbitai/e2e-reviewer.

Batch created pull request.

Summary by CodeRabbit

Release Notes

New Features

Added thread-queue-parallel processing mode enabling concurrent result processing while maintaining per-group ordering guarantees
Expanded uptime monitoring configuration with new parallel processing mode option
Improved offset tracking and management for distributed result processing

Tests

Added extensive test coverage for thread-queue-parallel processing functionality including offset management, error handling, and graceful shutdown scenarios

One potential problem we have with batch processing is that any one slow item will clog up the whole batch. This pr implements a queueing method instead, where we keep N queues that each have their own workers. There's still a chance of individual items backlogging a queue, but we can try increased concurrency here to reduce the chances of that happening

coderabbit-eval · 2026-02-04T19:46:25Z

📝 Walkthrough

Walkthrough

This PR introduces a new "thread-queue-parallel" processing mode for Kafka message consumption, featuring a robust multi-queue processing pipeline with per-group order preservation, thread-safe offset tracking, dedicated commit threads, and comprehensive metrics integration. The implementation spans configuration updates, core queue infrastructure, consumer integration, and extensive test coverage.

Changes

Cohort / File(s)	Summary
Mode Selection `src/sentry/consumers/__init__.py`	Added "thread-queue-parallel" as a new uptime mode choice and updated max-workers help text to reference "parallelism" instead of "threads."
Queue-Based Processing Pipeline `src/sentry/remote_subscriptions/consumers/queue_consumer.py`	New module implementing the core thread-queue-parallel infrastructure: WorkItem dataclass, OffsetTracker for thread-safe offset lifecycle management, FixedQueuePool for multi-queue distribution with consistent-group hashing, OrderedQueueWorker threads for processing, and SimpleQueueProcessingStrategy extending ProcessingStrategy with integrated commit loop and metrics.
Result Consumer Integration `src/sentry/remote_subscriptions/consumers/result_consumer.py`	Extended ResultsStrategyFactory to support thread-queue-parallel mode with new FixedQueuePool initialization, new create_thread_queue_parallel_worker method, and updated shutdown paths for queue pool cleanup.
Test Coverage `tests/sentry/remote_subscriptions/consumers/test_queue_consumer.py`	Comprehensive test module covering OffsetTracker behavior, FixedQueuePool queue assignment and ordering, SimpleQueueProcessingStrategy message processing and offset commits, and thread-queue-parallel factory integration.
Uptime Consumer Tests `tests/sentry/uptime/consumers/test_results_consumer.py`	Added eight thread-queue-parallel test cases validating concurrent processing, order preservation per subscription, offset committing, error handling, offset gaps, graceful shutdown, and end-to-end Kafka integration with real offset commits.

Sequence Diagram

sequenceDiagram
    participant Client as Message Consumer
    participant Strategy as SimpleQueueProcessingStrategy
    participant Decoder as Decoder Function
    participant Pool as FixedQueuePool
    participant Worker as OrderedQueueWorker
    participant Processor as Result Processor
    participant Tracker as OffsetTracker
    participant CommitLoop as Commit Loop
    participant Commit as Commit Function

    Client->>Strategy: submit(message)
    Strategy->>Decoder: decode(payload)
    Decoder-->>Strategy: result (T)
    Strategy->>Strategy: grouping_fn(result)
    Strategy->>Pool: submit(group_key, WorkItem)
    Pool->>Worker: enqueue WorkItem
    
    Worker->>Worker: run() processes queue
    Worker->>Processor: result_processor(subscription_id, result)
    Processor-->>Worker: complete
    Worker->>Tracker: complete_offset(partition, offset)
    Tracker->>Tracker: mark offset complete

    CommitLoop->>CommitLoop: periodic tick
    CommitLoop->>Tracker: get_committable_offsets()
    Tracker-->>CommitLoop: {partition: offset_map}
    CommitLoop->>Commit: commit_function(offset_map)
    Commit->>Commit: persist offsets
    CommitLoop->>Tracker: mark_committed(partition, offset)

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Hopping through queues, offsets aligned,
Parallel threads processing in kind,
Order preserved, no race conditions here,
Commits flow steady, safe and clear!
Peek peek! 🌱

🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title 'Automated Test: kafka-consumer-parallel-after' does not accurately describe the primary changes in this PR, which add thread-queue-parallel processing support across multiple modules, not just automated tests.	Revise the title to reflect the main changes, such as 'Add thread-queue-parallel processing strategy for remote subscriptions' or 'Implement thread-queue-parallel consumer mode with ordered queue processing'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch kafka-consumer-parallel-after

_{Comment @coderabbit eval help to get the list of available commands and usage tips.}

coderabbit-eval

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@src/sentry/remote_subscriptions/consumers/queue_consumer.py`:
- Around line 127-156: The run method currently always calls
offset_tracker.complete_offset in the finally block, so exceptions from
result_processor are swallowed and offsets advance even on failure; change the
logic so complete_offset(work_item.partition, work_item.offset) is only invoked
on successful processing (i.e., after result_processor returns without
exception) and not in the finally block, keep the metrics.gauge update in
finally or adjust to report failures separately, and ensure error paths (inside
the except Exception in run) do not call complete_offset so failed messages can
be retried on restart.
- Around line 344-345: The join method on the ProcessingStrategy implementation
currently ignores its timeout argument; update queue_consumer.QueueConsumer.join
to forward the timeout to the shutdown path (e.g., call self.close(timeout) or
pass timeout into the internal join/wait calls used by close) so callers like
StreamProcessor get a bounded shutdown; ensure the close signature and any
internal join methods accept and use the timeout (propagate the timeout through
close -> internal join calls).

In `@tests/sentry/remote_subscriptions/consumers/test_queue_consumer.py`:
- Around line 375-421: The test leaves the SimpleQueueProcessingStrategy's
background commit thread running because factory.shutdown() only stops the queue
pool; after creating the strategy via MockFactory.create_with_partitions you
must call strategy.close() to stop the strategy's commit thread before calling
factory.shutdown(); update
TestThreadQueueParallelIntegration.test_factory_creates_thread_queue_parallel_strategy
to invoke strategy.close() prior to factory.shutdown() to avoid the resource
leak.

In `@tests/sentry/uptime/consumers/test_results_consumer.py`:
- Around line 1874-1924: The test's assertion is flaky because
OrderedQueueWorker.run marks offsets in a finally block (so failed messages are
still completed), so update test_thread_queue_parallel_error_handling to expect
commits: after submitting two messages, wait for the queue_pool to drain and for
committed_offsets to contain test_partition, then assert
committed_offsets[test_partition] >= 101 (or that the partition exists and the
highest committed offset is the second message's offset). Use the existing
track_commits, factory.queue_pool and mock_processor_call loop to wait
deterministically for the commit loop before asserting; reference
OrderedQueueWorker.run behavior and
UptimeResultsStrategyFactory/create_with_partitions when locating the logic to
adjust the test expectation.

🧹 Nitpick comments (6)

src/sentry/remote_subscriptions/consumers/queue_consumer.py (4)

49-54: Minor: _get_partition_lock has a benign TOCTOU window.

Two threads could both see None from .get() and create separate Lock objects, but setdefault is atomic under CPython's GIL, so only one lock wins and both callers get the same instance. This is safe in practice, but if you want to be explicit, you could use a single defaultdict(threading.Lock) for partition_locks or protect with a global lock.

83-93: get_committable_offsets is O(offset_range), not O(tracked_offsets).

The range(start, max_offset + 1) loop iterates over every integer in the offset range, not just tracked offsets. If a partition has a sparse set of tracked offsets (e.g., offsets 100 and 10000), this iterates ~9900 times even though only 2 offsets are tracked. In practice this is bounded by processing throughput, but worth keeping in mind if offset values become sparse.

273-291: Commit loop waits 1 second before first commit check.

The self.shutdown_event.wait(1.0) at the top of the loop body (line 276) means the first commit check doesn't happen until ~1 second after the strategy is created, regardless of how quickly messages are processed. This is likely fine for production (matches ONCE_PER_SECOND policy), but it explains why some tests need to wait longer for commits to appear.

293-325: Exception during submit after add_offset but before queue_pool.submit could orphan offsets.

If self.decoder succeeds, self.grouping_fn succeeds, but self.queue_pool.submit fails (line 315), the exception handler (lines 317-325) checks isinstance(message.value, BrokerValue) and adds+completes the offset. However, queue_pool.submit calls offset_tracker.add_offset before work_queue.put (line 211-212). If put raises, the offset is added but never completed, and the exception handler adds it again. The add_offset call at line 211 adds the offset to all_offsets and outstanding, then the handler at line 320-325 adds it again (it's a set, so no duplicate), and completes it. This is actually safe because sets handle duplicates, but the flow is subtle and could benefit from a comment.
src/sentry/remote_subscriptions/consumers/result_consumer.py (1)
205-212: Consider using elif for the mode dispatch chain.

The mode flags are mutually exclusive, but the current if/if/if/else structure doesn't make that obvious. Using elif would make the intent clearer and prevent accidental fall-through if a future change removes an early return.
Proposed refactor
     def create_with_partitions(
         self,
         commit: Commit,
         partitions: Mapping[Partition, int],
     ) -> ProcessingStrategy[KafkaPayload]:
         if self.batched_parallel:
             return self.create_thread_parallel_worker(commit)
-        if self.parallel:
+        elif self.parallel:
             return self.create_multiprocess_worker(commit)
-        if self.thread_queue_parallel:
+        elif self.thread_queue_parallel:
             return self.create_thread_queue_parallel_worker(commit)
         else:
             return self.create_serial_worker(commit)
tests/sentry/uptime/consumers/test_results_consumer.py (1)
1824-1872: Polling loop could time out without asserting failure.

Lines 1860-1867: if the loop times out (queue never empties or no commits appear), execution falls through to the assertions at lines 1869-1870, which would fail with a less informative error. Consider adding an explicit timeout assertion like the other tests do: assert ... , "Processing did not complete in time".
Proposed improvement
             queue_pool = factory.queue_pool
             max_wait = 20
+            completed = False
             for _ in range(max_wait):
                 assert queue_pool is not None
                 stats = queue_pool.get_stats()
                 if stats["total_items"] == 0 and len(committed_offsets) > 0:
+                    completed = True
                     break
-
                 time.sleep(0.1)
+            assert completed, "Processing or commit did not complete in time"

coderabbit-eval · 2026-02-04T19:53:12Z

src/sentry/remote_subscriptions/consumers/queue_consumer.py

+    def run(self) -> None:
+        """Process items from the queue in order."""
+        while not self.shutdown:
+            try:
+                work_item = self.work_queue.get()
+            except queue.ShutDown:
+                break
+
+            try:
+                with sentry_sdk.start_transaction(
+                    op="queue_worker.process",
+                    name=f"monitors.{self.identifier}.worker_{self.worker_id}",
+                ):
+                    self.result_processor(self.identifier, work_item.result)
+
+            except queue.ShutDown:
+                break
+            except Exception:
+                logger.exception(
+                    "Unexpected error in queue worker", extra={"worker_id": self.worker_id}
+                )
+            finally:
+                self.offset_tracker.complete_offset(work_item.partition, work_item.offset)
+                metrics.gauge(
+                    "remote_subscriptions.queue_worker.queue_depth",
+                    self.work_queue.qsize(),
+                    tags={
+                        "identifier": self.identifier,
+                    },
+                )


⚠️ Potential issue | 🟠 Major

Errors in processing are silently swallowed, and the offset still advances.

complete_offset is called in the finally block (line 149), so even when result_processor raises an exception, the offset is marked as completed and will be committed. This means failed messages are permanently skipped with only a log entry. If this is intentional (at-most-once semantics), consider documenting it explicitly. If not, the offset should only be completed on success to allow re-processing on restart.

🤖 Prompt for AI Agents

In `@src/sentry/remote_subscriptions/consumers/queue_consumer.py` around lines 127 - 156, The run method currently always calls offset_tracker.complete_offset in the finally block, so exceptions from result_processor are swallowed and offsets advance even on failure; change the logic so complete_offset(work_item.partition, work_item.offset) is only invoked on successful processing (i.e., after result_processor returns without exception) and not in the finally block, keep the metrics.gauge update in finally or adjust to report failures separately, and ensure error paths (inside the except Exception in run) do not call complete_offset so failed messages can be retried on restart.

coderabbit-eval · 2026-02-04T19:53:12Z

src/sentry/remote_subscriptions/consumers/queue_consumer.py

+    def join(self, timeout: float | None = None) -> None:
+        self.close()


⚠️ Potential issue | 🟡 Minor

join() ignores the timeout parameter.

The timeout parameter from the ProcessingStrategy interface is unused. Callers (like StreamProcessor) may rely on it to bound shutdown time. Consider passing timeout through to close() or at least to the internal join calls.

Proposed fix

- def join(self, timeout: float | None = None) -> None: - self.close() + def join(self, timeout: float | None = None) -> None: + self.shutdown_event.set() + self.commit_thread.join(timeout=timeout) + self.queue_pool.shutdown()

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 344-344: Unused method argument: timeout

(ARG002)

🤖 Prompt for AI Agents

In `@src/sentry/remote_subscriptions/consumers/queue_consumer.py` around lines 344 - 345, The join method on the ProcessingStrategy implementation currently ignores its timeout argument; update queue_consumer.QueueConsumer.join to forward the timeout to the shutdown path (e.g., call self.close(timeout) or pass timeout into the internal join/wait calls used by close) so callers like StreamProcessor get a bounded shutdown; ensure the close signature and any internal join methods accept and use the timeout (propagate the timeout through close -> internal join calls).

coderabbit-eval · 2026-02-04T19:53:12Z

tests/sentry/remote_subscriptions/consumers/test_queue_consumer.py

+class TestThreadQueueParallelIntegration(TestCase):
+    """Integration test with the ResultsStrategyFactory."""
+
+    def test_factory_creates_thread_queue_parallel_strategy(self):
+        """Test that the factory properly creates thread-queue-parallel strategy."""
+        from sentry.remote_subscriptions.consumers.result_consumer import (
+            ResultProcessor,
+            ResultsStrategyFactory,
+        )
+
+        class MockResultProcessor(ResultProcessor):
+            @property
+            def subscription_model(self):
+                return mock.Mock()
+
+            def get_subscription_id(self, result):
+                return result.get("subscription_id", "unknown")
+
+            def handle_result(self, subscription, result):
+                pass
+
+        class MockFactory(ResultsStrategyFactory):
+            @property
+            def topic_for_codec(self):
+                return Topic("test")
+
+            @property
+            def result_processor_cls(self):
+                return MockResultProcessor
+
+            def build_payload_grouping_key(self, result):
+                return result.get("subscription_id", "unknown")
+
+            @property
+            def identifier(self):
+                return "test"
+
+        factory = MockFactory(mode="thread-queue-parallel", max_workers=5)
+        commit = mock.Mock()
+        partition = Partition(Topic("test"), 0)
+        strategy = factory.create_with_partitions(commit, {partition: 0})
+
+        assert isinstance(strategy, SimpleQueueProcessingStrategy)
+        assert factory.queue_pool is not None
+        assert factory.queue_pool.num_queues == 5
+
+        factory.shutdown()


⚠️ Potential issue | 🟡 Minor

Resource leak: strategy commit thread is never stopped.

The SimpleQueueProcessingStrategy returned at line 415 starts a daemon commit thread. factory.shutdown() at line 421 shuts down the queue pool but doesn't stop the strategy's commit thread. Call strategy.close() before factory.shutdown() to clean up properly.

Proposed fix

assert isinstance(strategy, SimpleQueueProcessingStrategy) assert factory.queue_pool is not None assert factory.queue_pool.num_queues == 5 + strategy.close() factory.shutdown()

🤖 Prompt for AI Agents

In `@tests/sentry/remote_subscriptions/consumers/test_queue_consumer.py` around lines 375 - 421, The test leaves the SimpleQueueProcessingStrategy's background commit thread running because factory.shutdown() only stops the queue pool; after creating the strategy via MockFactory.create_with_partitions you must call strategy.close() to stop the strategy's commit thread before calling factory.shutdown(); update TestThreadQueueParallelIntegration.test_factory_creates_thread_queue_parallel_strategy to invoke strategy.close() prior to factory.shutdown() to avoid the resource leak.

coderabbit-eval · 2026-02-04T19:53:12Z

tests/sentry/uptime/consumers/test_results_consumer.py

+    def test_thread_queue_parallel_error_handling(self) -> None:
+        """
+        Test that errors in processing don't block offset commits for other messages.
+        """
+        committed_offsets: dict[Partition, int] = {}
+
+        def track_commits(offsets: Mapping[Partition, int], force: bool = False) -> None:
+            committed_offsets.update(offsets)
+
+        factory = UptimeResultsStrategyFactory(
+            mode="thread-queue-parallel",
+            max_workers=2,
+        )
+
+        test_partition = Partition(Topic("test"), 1)
+        consumer = factory.create_with_partitions(track_commits, {test_partition: 0})
+
+        with mock.patch.object(type(factory.result_processor), "__call__") as mock_processor_call:
+            mock_processor_call.side_effect = [Exception("Processing failed"), None]
+
+            codec = kafka_definition.get_topic_codec(kafka_definition.Topic.UPTIME_RESULTS)
+
+            for offset, minutes in [(100, 5), (101, 4)]:
+                result = self.create_uptime_result(
+                    self.subscription.subscription_id,
+                    scheduled_check_time=datetime.now() - timedelta(minutes=minutes),
+                )
+                message = Message(
+                    BrokerValue(
+                        KafkaPayload(None, codec.encode(result), []),
+                        test_partition,
+                        offset,
+                        datetime.now(),
+                    )
+                )
+                consumer.submit(message)
+
+            queue_pool = factory.queue_pool
+            max_wait = 20
+            for _ in range(max_wait):
+                assert queue_pool is not None
+                stats = queue_pool.get_stats()
+                if stats["total_items"] == 0 and mock_processor_call.call_count >= 2:
+                    time.sleep(0.2)
+                    break
+                time.sleep(0.1)
+
+            assert mock_processor_call.call_count == 2
+            assert len(committed_offsets) == 0 or test_partition not in committed_offsets
+
+        factory.shutdown()


⚠️ Potential issue | 🟠 Major

Error handling test assertion may be incorrect or timing-dependent.

The OrderedQueueWorker.run method completes the offset in a finally block (line 149 of queue_consumer.py), meaning even on exception, the offset is marked as processed. Both offsets (100 and 101) should be completed and eventually committed by the commit loop. The assertion at line 1922 (len(committed_offsets) == 0 or test_partition not in committed_offsets) only passes if the commit loop hasn't executed yet — a timing-dependent condition that makes this test flaky.

Either:

The finally-based offset completion is wrong and should only happen on success (then this test is correct but the implementation has the bug I flagged in queue_consumer.py).

The implementation is intentional (at-most-once), and this assertion should expect the offsets to be committed.

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 1880-1880: Unused function argument: force

(ARG001)

🤖 Prompt for AI Agents

In `@tests/sentry/uptime/consumers/test_results_consumer.py` around lines 1874 - 1924, The test's assertion is flaky because OrderedQueueWorker.run marks offsets in a finally block (so failed messages are still completed), so update test_thread_queue_parallel_error_handling to expect commits: after submitting two messages, wait for the queue_pool to drain and for committed_offsets to contain test_partition, then assert committed_offsets[test_partition] >= 101 (or that the partition exists and the highest committed offset is the second message's offset). Use the existing track_commits, factory.queue_pool and mock_processor_call loop to wait deterministically for the commit loop before asserting; reference OrderedQueueWorker.run behavior and UptimeResultsStrategyFactory/create_with_partitions when locating the logic to adjust the test expectation.

coderabbit-eval bot closed this Feb 4, 2026

coderabbit-eval bot reviewed Feb 4, 2026

View reviewed changes

Automated Test: kafka-consumer-parallel-after #328

Automated Test: kafka-consumer-parallel-after #328

admin-coderabbit commented Feb 4, 2026 •

edited by coderabbit-eval bot

Loading

coderabbit-eval bot commented Feb 4, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

coderabbit-eval bot left a comment

coderabbit-eval bot Feb 4, 2026

coderabbit-eval bot Feb 4, 2026

coderabbit-eval bot Feb 4, 2026

coderabbit-eval bot Feb 4, 2026

		def join(self, timeout: float \| None = None) -> None:
		self.close()

Automated Test: kafka-consumer-parallel-after #328

Automated Test: kafka-consumer-parallel-after #328

Conversation

admin-coderabbit commented Feb 4, 2026 • edited by coderabbit-eval bot Loading

Summary by CodeRabbit

Release Notes

coderabbit-eval bot commented Feb 4, 2026 • edited Loading

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

coderabbit-eval bot left a comment

Choose a reason for hiding this comment

coderabbit-eval bot Feb 4, 2026

Choose a reason for hiding this comment

coderabbit-eval bot Feb 4, 2026

Choose a reason for hiding this comment

coderabbit-eval bot Feb 4, 2026

Choose a reason for hiding this comment

coderabbit-eval bot Feb 4, 2026

Choose a reason for hiding this comment

admin-coderabbit commented Feb 4, 2026 •

edited by coderabbit-eval bot

Loading

coderabbit-eval bot commented Feb 4, 2026 •

edited

Loading